Main Analysis (Exploratory Data Analysis)
Our research can be divided into three sections: General questions about the auction information, understanding fluctuations in lot prices and its correlation with different variables
General Auction Information
We started by asking many questions about possible relationships between variables. The first set of plots will explore the correlation between number of lots sold and year, location and season. We were hoping to notice meaningful trends that can be further explored in subsequent sections. Since we had only a few locations, seasons and years we chose a bar chart and excluded duplicate rows by the ‘number of lots’ column.
library(gridExtra)
##Lots by location
art_info <- subset(art_final, select=c( "location", "number_of_lots" ))
art_info <- art_info[!duplicated(art_info$number_of_lots),]
art_group <- art_info %>% group_by(location)%>% summarise(B=sum(number_of_lots))
p1 <- ggplot(art_group, aes(x= location, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey") +labs(x = "Location")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by location") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))
art_info2 <- subset(art_final, select=c( "auc_year", "number_of_lots" ))
art_info2 <- art_info2[!duplicated(art_info2$number_of_lots),]
art_info2 <- art_info2 %>% group_by(auc_year)%>% summarise(B=sum(number_of_lots))
p2 <- ggplot(art_info2, aes(x= auc_year, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey") +labs(x = "Auction year")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by year") + theme_minimal()+ theme(plot.title = element_text(hjust = 0.5,face="bold"))+ scale_x_continuous(breaks= c(2006, 2010, 2014, 2017))
grid.arrange(p1, p2, nrow = 1)
##Lots by season
art_info <- subset(art_final, select=c( "auc_season", "number_of_lots" ))
art_info <- art_info[!duplicated(art_info$number_of_lots),]
art_group <- art_info %>% group_by(auc_season)%>% summarise(B=sum(number_of_lots))
ggplot(art_group, aes(x= auc_season, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey") +labs(x = "Auction season")+labs(y = "Total number of lots") + ggtitle("Number of lots sold by season") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))
The graphs above suggest that fall and spring are the most popular seasons to acquire a masterpiece. At the same time, there was a significant increase in number of lots sold since 2010, suggesting that investors started to see the art market as a form of long-term non-liquid investment after the financial crisis of 2008 (we will explore this hypothesis in more detail later in the analysis). The main hubs for art exchanges formed in London, New York and Paris.
The next set of graphs will focus on the number of actions by year, location and season.
##Auctions by location, year, season
art_info <- subset(art_final, select=c( "auction_id", "location" ))
art_info <- art_info[!duplicated(art_info$auction_id),]
art_group <- art_info %>% group_by(location)%>% count(auction_id) %>% summarise(B=sum(n))
p1<-ggplot(art_group, aes(x= location, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey") +labs(x = "Auction location")+labs(y = "Number of auctions") + ggtitle("Number of auctions by location") + theme_minimal()+ theme(plot.title = element_text(hjust = 0.5,face="bold"))
art_info2 <- subset(art_final, select=c( "auction_id", "auc_year" ))
art_info2 <- art_info2[!duplicated(art_info2$auction_id),]
art_info2 <- art_info2 %>% group_by(auc_year)%>% count(auction_id) %>% summarise(B=sum(n))
p2<- ggplot(art_info2, aes(x= auc_year, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey")+labs(y = "Number of auctions")+labs(x = "Year") + theme_minimal() + ggtitle("Number of auctions by year") + theme(plot.title = element_text(hjust = 0.5,face="bold"))
grid.arrange(p1, p2, nrow = 1)
##Auctions by season
art_info <- subset(art_final, select=c( "auction_id", "auc_season" ))
art_info <- art_info[!duplicated(art_info$auction_id),]
art_group <- art_info %>% group_by(auc_season)%>% count(auction_id) %>% summarise(B=sum(n))
ggplot(art_group, aes(x= auc_season, y = B)) +
geom_bar( stat='identity', color="royalblue", fill="grey") +labs(x = "Auction season")+labs(y = "Number of auctions") + ggtitle("Number of auctions by season") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))
Plotting the number of auctions based on location, season and year turned out to have similar results to the previous section and confirmed ideas suggested above. Namely, fall and spring are the two main auction seasons. London, New York and Paris are the main locations for art trading while New York has more lots sold and London has more auctions overall. And after 2009 the number of auctions conducted yearly around the world increased significantly.
To answer our second set of questions and evaluate the different triggers of variability in art prices we decided to explore fluctuations in Hammer Prices. Therefore, we created a histogram with density curve to visualize the distribution of Hammer Price.
##Hammer Price
ggplot(art_final, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..)) + geom_density(col="red") + xlim(0,1500000) + theme_minimal() +xlab("Price") +ylab("Density") +ggtitle("Hammer Price Distribution")+ theme(plot.title = element_text(hjust = 0.5,face="bold"))
Distribution of Hammer Price is skewed to the right and has a long tail. We believe the reason for this is various price ranges for different groups of lots sold on the market. Variability in price can be explained by difference in art genres (Contemporary vs Renaissance for example), quality, age and popularity of the artwork. These factors can distinguish a masterpiece sold for $135 million like “Portrait of Adele Bloch-Bauer” by Gustav Klimt and Untitled painting by Mark Rothko sold for only 6.5 million.
Constructing valuable models in the next parts of our research fully depend on the ability to manipulate hammer price in the right way. We decided to remove the outliers.
art_final <-art_final%>%
filter(abs(art_final$hammer_price_bp_usd -
median(art_final$hammer_price_bp_usd)) <=3*sd(art_final$hammer_price_bp_usd))
ggplot(art_final, aes(x= hammer_price_bp_usd)) + geom_histogram(aes(y=..density..)) + geom_density(col="blue") + xlim(0,1500000)+ theme_minimal() +xlab("Price") +ylab("Density") +ggtitle("Hammer Price Distribution (New)")+ theme(plot.title = element_text(hjust = 0.5,face="bold"))+scale_y_continuous(labels = scales::comma)
Adjusted hammer price brought our attention to distribution of revenue over time and location. For both plots we chose a bar chart.
##revenue by location
MyData <- subset(art_final, select=c( "location", "auction_id", "hammer_price_bp_usd" ))
MyData5 <- MyData %>% group_by(location)%>% summarise(B=sum(hammer_price_bp_usd))
p1 <- ggplot(MyData5, aes(x= location, y = B)) +
geom_bar( stat='identity', color="yellow", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Location") + ggtitle("Auction revenue by location") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))+scale_y_continuous(labels = scales::comma)
##revenue by year
MyData_1 <- subset(art_final, select=c( "auc_year", "auction_id", "hammer_price_bp_usd" ))
MyData_1 <- MyData_1 %>% group_by(auc_year)%>% summarise(B=sum(hammer_price_bp_usd))
p2<-ggplot(MyData_1, aes(x= auc_year, y = B)) +
geom_bar(stat='identity', color="green", fill="grey")+labs(y = "Auction Revenue")+labs(x = "Year") + ggtitle("Auction revenue by year") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))+scale_y_continuous(labels = scales::comma)
grid.arrange(p1, p2, nrow = 1)
After the financial crisis world art revenues went up following the assumption that people saw art as a form of investment. In 2013 revenues declined again possibly due to a slow down in the art market and went back up in the following years. London and New York continue to lead the way as the main centers for the exchange of art.
Another question that we thought would help us explore the data: Variability of lots across auctions?
In the data we had 89 auctions varying in theme, concept and length. We were particularly interested in the genres of auctions that had the largest number of lots. We created a Cleveland dot plot and filtered by auctions that had more than 200 lots.
##Cleveland dot plot
MyData2 <- strtrim(art_final$auc_title, 40)
art_final$auc_title <- MyData2
MyData1 <- art_final[!duplicated(art_final$auc_title),]
MyData3 <- subset(MyData1, select=c("auc_title", "number_of_lots"))
MyData3$auc_title <- factor(MyData3$auc_title, levels = MyData3$auc_title[order(MyData3$number_of_lots)])
MyData3<- MyData3[which(MyData3$number_of_lots>200),]
ggplot(MyData3, aes( x = auc_title, y = number_of_lots)) + geom_point(stat="identity", color="red") + coord_flip()+ theme_minimal() +labs(y = "Number of lots by auction")+labs(x = "Number of lots") + ggtitle("Auction title") + theme(plot.title = element_text(hjust = 0.5,face="bold"))
Based on the graph we identified the theme of each auction and calculated shares of different themes. The result showed that 48% of the auctions with the highest number of lots are Contemporary auctions. This results can be easily explained since modern artists create more and more content every year while for older masters artworks are main resold (nothing new is generated).
To conclude exploration of data we created a parallel coordinate plot for auction year, birth year of the artist, lot number in the auction, and hammer price colored by location.
MyDatas <-subset(art_final, select=c( "percent_in_auction", "auc_year" , "birth_year", "hammer_price_bp_usd", "location"))
MyDatas<- MyDatas %>%
filter(birth_year>1400)
#devtools::install_github("timelyportfolio/parcoords")
library(parcoords)
#library(httpuv)
parcoords(MyDatas,
rownames = F
, brushMode = "1D-axes", alpha =0.5,color = list(
colorBy = "location", colorScale = htmlwidgets::JS("d3.scale.category10()"))
)
From this chart, we can see that most of New York’s lots have author born after 1900. However, other than that, we can’t read much valuable inforamtion in this graph as the correlations are not strong enough and lines are overlapping.
Do Certain Lot Attributes Result in Higher Price?
Lot Titles
What lots have higher price?
To begin the analysis, we want to see the titles of expensive lots and plot their size with respect to the prices.
df1 <- art_final
df_wordcloud <- df1[,c("lot_title","hammer_price_bp_usd")]
df_wordcloud <- arrange(df_wordcloud,desc(df_wordcloud$hammer_price_bp_usd))[1:1000,]
library(wordcloud)
library(tm)
pal <- brewer.pal(9, "OrRd")
pal <- pal[-(1:3)]
wordcloud(df_wordcloud$lot_title, min.freq = 10,df_wordcloud$hammer_price_bp_usd, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)
A lot of expensive lots appear to have “untitled” in their title. To get a better idea of what words artists tend to name their pieces, we count the frequency of each word appear in all titles and see which words do artists tend to use in the titles.
What words appear more often in the lot titles?
# collapse the lot_title column by word and count the frequency they appear in the titles.
temp <- paste(df1$lot_title, collapse=' ' )
temp <- tolower(temp)
temp <- gsub(" *\\b[[:alpha:]]{1}\\b *", " ", temp)
temp <- gsub('[[:punct:] ]+',' ',temp)
temp <- as.list(strsplit(temp, " "))
temp <- unlist(temp)[!(unlist(temp) %in% stopwords("english"))]
temp <- unlist(temp)[!(unlist(temp) %in% "na")]
word_count <- na.omit(as.data.frame(table(temp)))
word_count <- arrange(word_count,desc(word_count$Freq))[1:300,]
# visualize word frequencies
pal <- brewer.pal(9, "Dark2")
wordcloud(word_count$temp, word_count$Freq, min.freq =20, scale=c(5, .5), random.order = FALSE, random.color = FALSE, colors= pal)
Without surprise, “untitled” indeed are the most popular word artists tend to use. In that case, are “untitled” works more likely to receive higher prices?
Are “untitled” works mostly contemporary?
untitle_ratio = read.csv("untitle_ratio.csv", header=TRUE)
#tempstr <- strtrim(untitle_ratio$auc_title, 20)
#untitle_ratio$auc_title <- tempstr
untitle_ratio$is_contemporary = as.factor(untitle_ratio$is_contemporary)
ggplot(data=untitle_ratio, aes(x=reorder(auc_title,ratio), y=ratio,fill=is_contemporary))+geom_bar(stat="identity")+coord_flip() + xlab("Auction Title") + ylab("Ratio") +ggtitle("Auction Titles Based on their Ratio of Occurance")
Of the top 20 auctions that have the highest “untitled” ratio, we see that contemporary auctions are just slightly greater than non-contemporary auctions. In fact, by looking at auction names of non-contemporary auctions, we realize that smaller pieces of art (such as decoration, furnitures) tend to have “untitled” in their title as well! Perhaps those contribute mostly to the fact that “untitled” work have relatively lower prices.
Does the era of the lot affect its price?
Here, we will plot the hammer price against artists’ birth years. Note that we are using log scales on the y-axis as the range of prices is wide.
df3 <- df1 %>%
filter(df1$birth_year>1700)
ggplot(df3, aes(birth_year,hammer_price_bp_usd)) +
geom_smooth(method='lm',formula=y ~ poly(x, 3),color="orange")+
geom_point(alpha = .1) +
theme_bw()+scale_y_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x)))+geom_density_2d(bins = 5, color="orange")+
scale_x_continuous(breaks = seq(min(df3$birth_year), max(df3$birth_year),10))+
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
xlab("Author Birth Year")+
ylab("Hammer Price")+
ggtitle("Lot Hammer Price vs Author Birth Year")
From the above chart, we can see that modern pieces have larger variance. We’d like to try a box plot to capture this feature.
#box plot price vs year
ggplot(df3, aes(auth_era, hammer_price_bp_usd)) +
geom_boxplot()+theme(axis.text.x = element_text(angle = 90, hjust = 1))+scale_y_log10() +xlab("Author's Era") +ylab("Price") +ggtitle("Author's Era vs Price of Their Work")
We notice that for certain eras (e.g. 1750), there are no pieces sold. There are also more outliers for authors born 1900 - 1980. However, due to the complex change in average price over every 10 years, this box plot looks messy overall. Therefore, we decided to only include the scatter plot as that contains the information more straightforward.
Do Certain External Factors Result in Higher Price?
Does the Order Matter?
In general, there are many lots in each auction. In our dataset, the average number of lots per auction is 357. With the large amount of lots being auctioned, we assume that the most valuable pieces get presented early in the auction. To validate this idea, we normalized the order of lots presented in each auction and plot it against the hammer price.
ggplot(df1, aes(percent_in_auction,hammer_price_bp_usd)) +
geom_smooth(method='lm',formula=y~x,color="red")+
geom_point(alpha = .05) +
theme_grey(10)+scale_y_log10()+
facet_wrap(~location)
Is there an impact from the financial crisis?
Let’s start by looking at the average lot prices of Sotheby’s on a monthly scale.
df1$auc_ymd <- as.Date(df1$auc_year_month_date)
art_yearfin <- df1 %>% group_by(month=lubridate::floor_date(auc_ymd, "month")) %>% summarise(revenue = mean(hammer_price_bp_usd))
ggplot() +geom_line(data=art_yearfin, aes(x=month, y=revenue/1000000))+ggtitle("Financial Crisis' Effect on average lot price ($M)")+ylab("Average Price ($M)")+xlab("Year")+theme(axis.text.x = element_text(angle = 45, hjust = 1))+scale_y_log10() +theme_bw()
It appears that the monthly scale is not very helpful in determining the effect because of the different fluctuations around each auction. Perhaps looking at the average price with a little less granularity will help? Since the monthly scale appears to have too much “noise,” we redrew the graph looking at a yearly scale instead. Our guess was that we are supposed to see a significant drop around the time of the financial crisis.
art_finance <- art_final[c("auc_year", "auc_month", "location","hammer_price_bp_usd")] %>% filter(!is.na(hammer_price_bp_usd))
art_yearfin <- art_finance %>% group_by(auc_year) %>% summarise(av_revenue =mean(hammer_price_bp_usd))
ggplot(art_yearfin, aes(x=auc_year, y=as.numeric(av_revenue/1000000))) +geom_line(col="darkblue")+ggtitle("Financial Crisis' Effect on Average Auction Price ($M)")+ylab("Average Price ($M)")+xlab("Year") +theme_bw()+scale_x_continuous(breaks = seq(min(art_final$auc_year), max(art_final$auc_year),1))
Indeed, we observe a big dip around late 2010. It may be surprising to see that it took some time for the effect to reach the auction houses. Nonetheless, when thinking back on the global time line of the Financial Crisis’ effect, it did take a few years for it to reach other industries outside of finance.
Next, let us explore the same effect on the total revenue and see if the same pattern holds.
art_finance <- art_final[c("auc_year", "auc_month", "location","hammer_price_bp_usd")] %>% filter(!is.na(hammer_price_bp_usd))
art_yearfin <- art_finance %>% group_by(auc_year) %>% summarise(revenue = sum(hammer_price_bp_usd))
ggplot(art_yearfin, aes(x=auc_year, y=as.numeric(revenue/1000000))) +geom_line(col="darkblue")+ggtitle("Financial Crisis' Effect on Auction Revenue ($M)")+ylab("Revenue ($M)")+xlab("Year")+theme_bw()+scale_x_continuous(breaks = seq(min(art_final$auc_year), max(art_final$auc_year),1))
We can easily tell that the two graphs are very similar and that total revenue did experience the same drop. Perhaps there are certain locations that are skewing some of the data. Let us try to facet the data by location and see if that could present us with a better outlook. We will be excluding Dubai and Doha due to lack of data for hammer prices (as shows in the data quality part).
art_locfin <- art_finance %>% group_by(auc_year, location) %>% summarise(av_revenue = mean(hammer_price_bp_usd)) %>% filter(location %in% c("AMSTERDAM","HONG KONG","NEW YORK", "LONDON", "PARIS", "MILAN"))
ggplot(art_locfin, aes(x=auc_year, y=as.numeric(av_revenue/1000000))) +geom_line(col="darkblue")+ggtitle("Financial Crisis' Effect on Average Auction Price ($M)")+ylab("Average Price ($M)")+xlab("Year") + facet_wrap(~location, scales ="free_y")+theme_bw()+scale_x_continuous(breaks = seq(min(art_final$auc_year), max(art_final$auc_year),1))+theme(axis.text.x = element_text(angle = 45, hjust = 1))
We notice a consistent drop in revenue across all locations above starting 2010. Therefore, our original hypothesis seems correct: the financial crisis did have an effect on the auction revenue across the world (specifically significant drops are observed in New York and Hong Kong).
Does Season Matter?
df1 <- df1 %>%
dplyr::mutate(auc_season = forcats::fct_relevel(auc_season, "summer")) %>%
dplyr::mutate(auc_season = forcats::fct_relevel(auc_season, "fall")) %>%
dplyr::mutate(auc_season = forcats::fct_relevel(auc_season, "winter"))
vcd::mosaic(hammer_price_bp_usd_range~auc_season, direction = c("v", "h"),df1,
gp = gpar(fill = c("yellow", "purple")),
labeling = labeling_border(rot_labels = c(0, 90),pos_labels="center"))
The labels of hammer price range did not turn out as neat in the beginning, however, after changing the order of the seasons, we obtained the above graph indicating the clear correlations between seasons and the lot prices. Specifically, there are more lots sold in spring and fall and lots are sold relatively higher in the fall.
Does Location Matter?
To explore which locations have higher average price we created a bar chart.
MyData_2a <- subset(art_final, select=c( "location", "hammer_price_bp_usd" ))
MyData_2b <- MyData_2a %>% group_by(location)%>% summarise(B=mean(hammer_price_bp_usd))
ggplot(MyData_2b, aes(x= location, y = B)) +
geom_bar(stat='identity', color="blue", fill="grey")+labs(y = "Average price")+labs(x = "Location") + ggtitle("Average price by location") + theme_minimal() + theme(plot.title = element_text(hjust = 0.5,face="bold"))
From the graph it can be seen that Hong Kong, London and New York have the highest average lot price. Which confirms our assumption from part 1 that New York and London are the main hubs for art exchange.